Cache return order optimization

ABSTRACT

Improving operation of a processing unit to access data within a cache system. A first fetch request and one or more subsequent fetch requests are accessed in an instruction stream. An address of data sought by the first fetch requested is obtained. At least a portion of the address of data sought by the first fetch request in inserted in each of the one or more subsequent fetch requests. The portion of the address inserted in each of the one or more subsequent fetch requests is utilized to retrieve the data sought by the first fetch request first in order from the cache system.

BACKGROUND

The present invention relates generally to the field of computer memorycache access and instruction pipelining, and more particularly toimproving processor efficiency in memory caching by modifying addressesof fetch requests to maintain data return order.

BRIEF SUMMARY

Embodiments of the present invention disclose a method, system, andcomputer program product for improving operation of a processing unit toaccess data within a cache system. A first fetch request and one or moresubsequent fetch requests are accessed in an instruction stream. Anaddress of data sought by the first fetch requested is obtained. Atleast a portion of the address of data sought by the first fetch requestin inserted in each of the one or more subsequent fetch requests. Theportion of the address inserted in each of the one or more subsequentfetch requests is utilized to retrieve the data sought by the firstfetch request first in order from the cache system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart illustrating operational steps to maintain datareturn order, in an embodiment of the invention.

FIG. 2 is a functional block diagram displaying an environment for cachereturn order optimization, in an embodiment of the invention.

FIG. 3 displays a process of execution of cache return orderoptimization, in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

Nearly every modern processor uses memory caching to access morefrequently needed data in the fastest manner possible, rather thanalways directly accessing main system memory. First, second, third, andeven, in some processor designs, fourth and even higher level cacheseach present fast, and progressively larger locations from the processorto store and write data, and even though each cache level is moredistant from the microprocessor, all cache levels are closer and allowfaster access than accesses from main system memory. The goal of thecaches' availability very close to the processor, is to improve memorysearch times for frequently needed data, with the end result of areduction of the time needed to execute.

Cache access, as with other computer processes, occurs via pipelinedinstructions executed by the processor. Each “stream” of pipelinedinstructions may include, for example, one or more “fetch” requests, inwhich data is sought from the cache, as well as various other steps. Ifdata is sought from a certain cache during a fetch request, but is notfound, a search in a higher level cache may be scheduled by theprocessor during a subsequent “fetch” request.

The stream of pipelined instructions is broken up by a processing unitto multiple independent fetch requests as it is executed, but the datarequested by each independent fetch request may exist in differentlevels of the cache system. It is possible, therefore, that the firstfetch request in the stream of pipelined instructions seeks dataexisting in a higher level of the cache, and, for various reasons,access to this cache level occurs at a later time than access to a lowercache level where the remainder of the data sought by subsequent fetchrequests reside. This causes data return order to become askew whichleads to slowdowns and even stalls in program execution, while waitingfor the sought data from an early fetch request to be returned.

Presented is a system, method, and computer program product forimproving operation of the processing unit by maintaining data returnorder when executing multiple fetch requests from a multilevel cache.

FIG. 1 is a flowchart illustrating operational steps to maintain datareturn order, in an embodiment of the invention. At step 100, executionbegins. At step 110, a first fetch request and one or more subsequentfetch requests are accessed from an instruction stream. At step 120, anaddress of data sought by the first fetch request is obtained forfurther use. For simplicity's sake, shortened addresses such as x40 areused throughout. At step 130, at least a portion of the address soughtby the first fetch request is inserted in each of the one or moresubsequent fetch requests. At step 140, a determination is made that thedata sought by the first fetch request is not found in a first levelcache associated with the processing unit.

At step 150, an attempt to access a translation look aside buffer ismade to determine an absolute address of the data sought in a higherlevel cache or system memory. Alternately, and depending upon thelocation of the data and system architecture, a logical address may beutilized in seeking the same data. At step 160, an attempt to access thetranslation look aside buffer fails to arbitrate access. At step 170,and as execution of pipeline stages proceeds, the portion of the addressinserted into each of the one or more subsequent fetch requests isutilized to confirm that the data requested by the first fetch requestis actually retrieved first. Although simplified addresses such as x40are used herein, in practice a few low order bits of the addressresponsible for data return may be inserted. Execution may then proceedto return subsequent data to transfer to a cache line in the first levelcache, with the full cache line of data utilized in subsequent executionby the processing unit. The advantage of the present invention is thereturning of the first requested data first in time, so as to allowexecution to proceed in the fastest manner possible. Execution proceedsto end 190, but may restart immediately and return to start 100.

FIG. 2 is a functional block diagram displaying an environment for cachereturn order optimization, in an embodiment of the invention. Theprocessing unit is displayed 200. Displayed also is an L1 cachedirectory 210, translation look aside buffer 2 (“TLB2”) 220, andtranslation look aside buffer 3 (“TLB3”) for providing access,respectively, to an L1 cache 240, an L2 cache 250, and an L3 cache 260.The processing unit 200 utilizes for fast access the L1 cache 240, theL2 cache 250, and the L3 cache 260 (versus direct accesses to systemmemory). The L1 cache 240 is closest to the processing unit 200, withthe fastest data transfer rate between the L1 cache 240 and theprocessing unit 200, but with the smallest amount of storage spaceavailable. The L2 cache 250 is an intermediate distance from theprocessing unit 200, with an intermediate data transfer rate between theL2 cache 250 and the processing unit 200, and an intermediate amount ofstorage space available. The L3 cache 260 is the farthest distance fromthe processing unit 200, with the slowest data transfer rate between theprocessing unit 200 and the L3 cache 260, but the largest amount ofstorage space available.

The L1 cache directory 210 provides access to information residing inthe L1 cache 240. The L1 cache directory 210 also identifies whetherdata is present within the L1 cache 240 (i.e. whether a fetch requestwill “hit” or “miss”). If data is not present to respond to a fetchattempted from the processing unit 200, the L1 cache directory 210 mayidentify which higher level cache to access of the L2 cache 250 and theL3 cache 260. Data in the L1 cache 240 is accessible via logicaladdresses (also referred to as “virtual addresses”), and may operate oncache line granularity. Translation look aside buffer 2 220 providesaccess to information residing in the L2 cache 250. Translation lookaside buffer 3 230 provides access to information residing in the L3cache 260. Data in the L2 cache 250 and the L3 cache 260 are accessiblevia absolute addresses (also referred to as “physical addresses”). Thetranslation look aside buffer 2 220 and translation look aside buffer 3230 service many potential requestors, including the processing unit200, an instruction cache, multiple load-store units, etc., but are onlysingle-ported so cannot service multiple requests simultaneously. Accessto the TLB2 220 and TLB3 230 must therefore be successfully “arbitrated”by the processing unit 200 in order to process a fetch request.

When accessing and executing a first fetch request and subsequent fetchrequests, the processing unit 200 first accesses the L1 cache directory210 to see if the data is present in the L1 cache directory 210. If theprocessing unit 200 determines the data requested by the first fetchrequest is not present in the L1 cache 240, the L1 cache directory 210may indicate that the data is located in the L2 cache 250 or the L3cache 260. The processing unit 200 may then access TLB2 220 or TLB3 230to determine the address of the data.

If the data is located in the L2 cache 250 or L3 cache 260, the datarequested by the fetch request may be loaded after retrieval into acache line of the L1 cache 240, with an entire line being loaded intothe cache line of the L1 cache 240 to support fast continuation (asrequested by one or more subsequent fetch requests), although only thedata requested by the first fetch request is truly necessary. Datafollowing the first fetch request is loaded in the pre-determined order,with “wrap-around” supported to fill the cache line of the L1 cache 240.If, for example, data requested at x80 was requested in the first fetchrequest, x90, xA0, xB0, xC0, xD0, xE0, xF0, x00, x10, x20, x30, x40,x50, x60, x70 are also subsequently requested by subsequent fetchrequests (with wrap-around occurring at x00 . . . ) to fill the cacheline of the L1 cache 240, and support fast availability of potentiallyuseful data in the vicinity of the first fetch request at x80 via the L1cache 240 after retrieval.

Depending upon how the first fetch request and one or more subsequentfetch requests in an instruction stream are processed by the TLB2 220and TLB3 230, and where arbitration is granted first for the subsequentfetch requests (based upon other requests from the load-store unit,etc.), it is possible information requested in subsequent fetch requestsis loaded first into the L1 cache 240, while waiting for data in thefirst fetch request which was truly necessary to begin with. Byinserting at least a portion of the address of the data sought by thefirst fetch request in each of the subsequent fetch requests, thepresently disclosed invention confirms that the data in the first fetchrequests and subsequent requests is loaded in the correct order, tomaximize speed and minimize power consumption by the processing unit200. This process is further discussed in connection with FIG. 3.

FIG. 3 displays a process of execution of cache return orderoptimization, in accordance with an embodiment of the invention.Displayed 410 is a process of execution without the benefit of thepresently disclosed invention. As displayed 411-419, a first fetchrequest 1 411 and subsequent fetch requests 412, 413, 415, 417, and 419are being executed. The first fetch request 1 411 is requesting data ataddress x40. Subsequent fetch request 2 412 is requesting data ataddress x60. Subsequent fetch request 3 413 is requesting data ataddress x80. Subsequent fetch request 4 415 is requesting data at xA0.Subsequent fetch request 5 417 is requesting data at xC0. Subsequentfetch request 6 419 is requesting data at xE0.

In this process of execution 410, fetch requests are executed atinstruction cycle i2. First fetch request 1 411 would execute atinstruction cycle i2 (431), but due to not arbitrating access to atranslation look aside buffer, the first fetch request 1 411 is notexecuted. Subsequent fetch request 2 412 would execute at instructioncycle i2 (433), but again due to not arbitrating access to thetranslation look aside buffer, subsequent fetch request 2 412 is notexecuted. Subsequent fetch request 3 413 executes at instruction cyclei2 (435), and successfully arbitrates access to the translation lookaside buffer. Data at address x80 is therefore successfully returned bysubsequent fetch request 3 413. Subsequent fetches 415, 417, 419successfully arbitrate access to the translation look aside buffer, andwrap around occurs to return the remainder of data.

The data return order is displayed 425, as x80, x90, xA0, xB0, xC0, xD0,xE0, xF0, x00, x10, x20, x30, x40, x50, x60, x70. Thus, even though thefirst fetch request 1 411 was for data at x40, the first data returnedis at x80, as displayed 427. The data at x40 was returned 429, muchlater, after receiving a restart indication (not shown here). Multiplefetches and other instructions may have been executed in the interim,leading to systematic delays, stalls, and possibly increased powerconsumption due to not receiving the first requested data x40 first.

Displayed 460 is a process of execution with the benefit of thepresently disclosed invention. As displayed 461-469, a first fetchrequest 1 461 and subsequent fetch requests 462, 463, 465, 467, and 469are being executed. The first fetch request 1 461 is requesting data ataddress x40. Since the first fetch request 1 461 is requesting data atx40, in the presently disclosed invention, the address of data sought bythe first fetch request 1 461 is inserted in all subsequent fetchrequests 462, 463, 465, 467, 469 to maintain order of the retrieved datawhen retrieving from higher level caches such as the L2 cache 250 or L3cache 260. Subsequent fetch request 2 462 is requesting data at addressx60. Subsequent fetch request 3 463 is requesting data at address x80.Subsequent fetch request 4 465 is requesting data at xA0. Subsequentfetch request 5 467 is requesting data at xC0. Subsequent fetch request6 469 is requesting data at xE0. Address insert x40 remains associatedwith all subsequent fetch requests 462-469, to ascertain that x40 isretrieved first.

In this process of execution 460, as previously with 410, fetch requestsare executed at instruction cycle i2. First fetch request 1 461 wouldexecute at instruction cycle i2 (481), but due to not arbitrating accessto a translation look aside buffer, the first fetch request 1 461 is notexecuted. Subsequent fetch request 2 462 would execute at instructioncycle i2 (483), but again due to not arbitrating access to thetranslation look aside buffer, subsequent fetch request 2 462 is notexecuted. Subsequent fetch request 3 463 executes at instruction cyclei2 (485), and successfully arbitrates access to the translation lookaside buffer. Data at address x40 is successfully returned by subsequentfetch request 3 463, since the address insert is utilized in fetchrequest 3 463. Subsequent fetches 465, 467, 469 successfully arbitrateaccess to the translation look aside buffer, and wrap around occurs toreturn the remainder of data.

The data return order is displayed 475, as x40, x50, x60, x70, x80, x90,xA0, xB0, xC0, xD0, xE0, xF0, x00, x10, x20, x30. This is the correctorder, with the first requested data x40 returned first in the datareturn order 475, as displayed 477. The fewest possible number offetches have been executed, leading to a minimum of system delay. Thereturned data may be utilized in a line of the L1 cache 240, or inanother way based upon system architecture.

Based on the foregoing, a computer system, method and program producthave been disclosed for cache return order optimization. However,numerous modifications and substitutions can be made without deviatingfrom the scope of the present invention. The embodiment(s) herein may becombined, altered, or portions removed. Therefore, the present inventionhas been disclosed by way of example and not limitation.

1. A method of improving operation of a processing unit to access datawithin a cache system, the method comprising: receiving by a processingunit a stream of pipelined instructions including one or more fetchrequests seeking data from a multilevel cache or system memory; breakingup by the processing unit the stream of pipelined instructions duringexecution into multiple independent fetch requests including a firstfetch request and one or more subsequent fetch requests each seekingdata existing in different levels of the multilevel level cache systemor system memory; accessing by the processing unit the first fetchrequest and one or more subsequent fetch requests; obtaining by theprocessing unit a memory address of data sought by the first fetchrequest, the obtained memory address a logical address where the data islocated in the multilevel level cache system or system memory; insertinglow order bits of the memory address of data sought from the cachesystem or main system memory by the first fetch request in each of theone or more subsequent fetch requests seeking data located in the cachesystem or main system memory; determining, by accessing a first levelcache directory indicating data present in the first level cache, thatdata sought by the first fetch request is not found in the first levelcache of the multilevel level cache system; attempting to access atranslation look aside buffer providing access to information residingin a higher cache level of the multilevel cache system higher than thefirst level cache to determine an absolute address or a physical addressof the data sought in a higher level cache or system memory, the attemptto access the translation look aside buffer failing to arbitrate access;and utilizing the low order bits portion of the memory address insertedin each of the one or more subsequent fetch requests in place of thememory address contained within each of the one or more subsequent fetchrequests to retrieve the data sought by the first fetch request.